Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse
نویسندگان
چکیده
Reusing data in cache is critical to achieving high performance on modern machines, because it reduces the impact of the latency and bandwidth limitations of direct memory access. To date, most studies of software memory hierarchy management have focused on the latency problem in loops. However, today’s machines are increasingly limited by the insufficient memory bandwidth— latency-oriented techniques are inadequate because they do not seek to minimize the amount of data transferred from memory over the whole program. To address the bandwidth limitation, this paper explores the potential for global cache reuse—that is, reusing data across loops nests and over the entire program. In particular, the paper investigates a two-step strategy. The first step fuses computations on the same data to enable the caching of repeated accesses. The second step groups data used by the same computation to make them contiguous in memory. While the first step reduces the frequency of memory access, the second step improves its efficiency. The paper demonstrates the effectiveness of this strategy and shows how to automate it in a production compiler.
منابع مشابه
Improving Effective Bandwidth through Compiler Enhancement of Global and Dynamic Cache Reuse
While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent of peak CPU performance. The hardware solution, which provides layers of high-bandwidth data cache...
متن کاملA Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential
Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in or...
متن کاملCache-Partitioned Tiling for Data Reuse Across Loop Nests
This paper presents cache-partitioned tiling, a systematic and integrated approach for global optimization of cache locality across multiple loop nests which reference multiple arrays. The approach is based on the idea of cache partitioning, in which the cache capacity is divided into a number of equal-sized sections. A data layout in memory is derived to eliminate cache connicts by ensuring th...
متن کاملProgram Transformations for Cache Locality Enhancement on Shared - memory
Program Transformations for Cache Locality Enhancement on Shared-memory Multiprocessors Naraig Manjikian Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 1997 This dissertation proposes and evaluates compiler techniques that enhance cache locality and consequently improve the performance of parallel applications on shared-memory multiprocesso...
متن کاملReducing Memory Bandwidth Consumption Via Compiler-Driven Selective Sub-Blocking
As processors continue to deliver higher levels of performance and as memory latency tolerance techniques become widespread to address the increasing cost of accessing memory, memory bandwidth will emerge as a major performance bottleneck. Rather than rely solely on wider and faster memories to address memory bandwidth shortages, an alternative is to use existing memory bandwidth more efficient...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001